Nonparametric Bayesian Models for Unsupervised Learning

نویسنده

  • Pu Wang
چکیده

NONPARAMETRIC BAYESIAN MODELS FOR UNSUPERVISED LEARNING Pu Wang, PhD George Mason University, 2011 Dissertation Director: Carlotta Domeniconi Unsupervised learning is an important topic in machine learning. In particular, clustering is an unsupervised learning problem that arises in a variety of applications for data analysis and mining. Unfortunately, clustering is an ill-posed problem and, as such, a challenging one: no ground-truth that can be used to validate clustering results is available. Two issues arise as a consequence. Various clustering algorithms embed their own bias resulting from different optimization criteria. As a result, each algorithm may discover different patterns in a given dataset. The second issue concerns the setting of parameters. In clustering, parameter setting controls the characterization of individual clusters, and the total number of clusters in the data. Clustering ensembles have been proposed to address the issue of different biases induced by various algorithms. Clustering ensembles combine different clustering results, and can provide solutions that are robust against spurious elements in the data. Although clustering ensembles provide a significant advance, they do not address satisfactorily the model selection and the parameter tuning problem. Bayesian approaches have been applied to clustering to address the parameter tuning and model selection issues. Bayesian methods provide a principled way to address these problems by assuming prior distributions on model parameters. Prior distributions assign low probabilities to parameter values which are unlikely. Therefore they serve as regularizers for modeling parameters, and can help avoid over-fitting. In addition, the marginal likelihood is used by Bayesian approaches as the criterion for model selection. Although Bayesian methods provide a principled way to perform parameter tuning and model selection, the key question “How many clusters?” is still open. This is a fundamental question for model selection. Nonparametric Bayesian approaches have been proposed to address this important model selection issue. Unlike parametric Bayesian models, for which the number of parameters is finite and fixed, nonparametric Bayesian models allow the number of parameters to grow with the number of observations. After observing the data, nonparametric Bayesian models fit the data with finite dimensional parameters. An additional issue with clustering is high dimensionality. High-dimensional data pose a difficult challenge to the clustering process. A common scenario with high-dimensional data is that clusters may exist in different subspaces comprised of different combinations of features (dimensions). In other words, data points in a cluster may be similar to each other along a subset of dimensions, but not in all dimensions. People have proposed subspace clustering techniques, a.k.a. co-clustering or bi-clustering, to address the dimensionality issue (here, I use the term co-clustering). Like clustering, also co-clustering suffers from the ill-posed nature and the lack of ground-truth to validate the results. Although attempts have been made in the literature to address individually the major issues related to clustering, no previous work has addressed them jointly. In my dissertation I propose a unified framework that addresses all three issues at the same time. I designed a nonparametric Bayesian clustering ensemble (NBCE) approach, which assumes that multiple observed clustering results are generated from an unknown consensus clustering. The underlying distribution is assumed to be a mixture distribution with a nonparametric Bayesian prior, i.e., a Dirichlet Process. The number of mixture components, a.k.a. the number of consensus clusters, is learned automatically. By combining the ensemble methodology and nonparametric Bayesian modeling, NBCE addresses both the ill-posed nature and the parameter setting/model selection issues of clustering. Furthermore, NBCE outperforms individual clustering methods, since it can escape local optima by combining multiple clustering results. I also designed a nonparametric Bayesian co-clustering ensemble (NBCCE) technique. NBCCE inherits the advantages of NBCE, and in addition it is effective with high dimensional data. As such, NBCCE provides a unified framework to address all the three aforementioned issues. NBCCE assumes that multiple observed co-clustering results are generated from an unknown consensus co-clustering. The underlying distribution is assumed to be a mixture with a nonparametric Bayesian prior. I developed two models to generate co-clusters in terms of rowand columnclusters. In one case rowand column-clusters are assumed to be independent, and NBCCE assumes two independent Dirichlet Process priors on the hidden consensus co-clustering, one for rows and one for columns. The second model captures the dependence between rowand column-clusters by assuming a Mondrian Process prior on the hidden consensus co-clustering. Combined with Mondrian priors, NBCCE provides more flexibility to fit the data. I have performed extensive evaluation on relational data and protein-molecule interaction data. The empirical evaluation demonstrates the effectiveness of NBCE and NBCCE and their advantages over traditional clustering and co-clustering methods. Chapter 1: Introduction

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Particle Filtering for Nonparametric Bayesian Matrix Factorization

Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distr...

متن کامل

Unsupervised Coreference Resolution in a Nonparametric Bayesian Model

We present an unsupervised, nonparametric Bayesian approach to coreference resolution which models both global entity identity across a corpus as well as the sequential anaphoric structure within each document. While most existing coreference work is driven by pairwise decisions, our model is fully generative, producing each mention from a combination of global entity properties and local atten...

متن کامل

Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

One of the reasons nonparametric Bayesian inference is attracting attention in computational linguistics is because it provides a principled way of learning the units of generalization together with their probabilities. Adaptor grammars are a framework for defining a variety of hierarchical nonparametric Bayesian models. This paper investigates some of the choices that arise in formulating adap...

متن کامل

Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

We present a sequence of unsupervised, nonparametric Bayesian models for clustering complex linguistic objects. In this approach, we consider a potentially infinite number of features and categorical outcomes. We evaluated these models for the task of withinand cross-document event coreference on two corpora. All the models we investigated show significant improvements when compared against an ...

متن کامل

Introducing of Dirichlet process prior in the Nonparametric Bayesian models frame work

Statistical models are utilized to learn about the mechanism that the data are generating from it. Often it is assumed that the random variables y_i,i=1,…,n ,are samples from the probability distribution F which is belong to a parametric distributions class. However, in practice, a parametric model may be inappropriate to describe the data. In this settings, the parametric assumption could be r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011